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Overlapping cDNAs have been isolated containing 
all the coding sequences for Artemia salina protein 
GRP33, a glycine-rich protein (16.6 mol % glycine), 
with a molecular weight of 32,992. GRP33 is closely 
related to HD40, the major protein component of Ar¬ 
temia heterogeneous nuclear ribonucleoprotein parti¬ 
cles, and shares certain characteristics with other RNA 
binding proteins. The C-terminal region (123 amino 
acids) contains 39 glycine residues. This region has 
multiple arginine residues flanked by glycines, resem¬ 
bling the glycine-dimethylarginine clusters present in 
other RNA binding proteins. Secondary structure pre¬ 
dictions for the protein reveal two distinct domains: a 
hydrophilic C-terminal domain with an extended con¬ 
formation and a larger N-terminal domain with a num¬ 
ber of a-helices and /8-sheets. 


In eukaryotic cells, heterogeneous nuclear RNA is associ¬ 
ated with a defined set of nuclear proteins to form ribonucle¬ 
oprotein particles or complexes (hnRNPs), 1 which can be 
recovered from purified nuclei as substructures with a rela¬ 
tively homogeneous sedimentation coefficient of 30-40 S (1, 
2). A major fraction of the proteins from these particles 
consists of a class of immunologically cross-reactive peptides 
with molecular weights between 30,000 and 45,000 (3, 4). The 
amino acid compositions of these proteins are similar, char¬ 
acterized by a high content of glycine (about 20%), very few 
cysteines, blocked amino termini, and the presence of the 
modified amino acid dimethylarginine (2, 5-10). hnRNP pro¬ 
teins sharing these characteristics have been found in many 
divergent species among vertebrates: duck, hamster, mouse, 
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human, as well as in plants, and have been termed “glycine- 
rich” proteins or “core” hnRNP proteins (6, 10). 

Little is known about the specific function of the core 
proteins; they are thought to be involved in the packaging of 
heterogeneous nuclear RNA. hnRNP assembly takes place 
immediately after transcription, and antibodies against these 
proteins have been shown to inhibit splicing in vitro (11, 12) 
suggesting that core proteins are present in the splicing com¬ 
plexes and are important for RNA processing. 

The core proteins in the 30 S particles isolated from several 
different cell types appear as three major groups (A, B, and 
C) of closely spaced doublets on SDS-PAGE (2, 13,14). Two- 
dimensional gel analysis reveals further complexity, showing 
some of these proteins contain several differently charged 
species (6, 15, 16). 

The 30 S particles from the brine shrimp Artemia salina 
seem to have a relatively simple protein composition (17). 
The major protein component has been purified to homoge¬ 
neity (18). It is a helix-destabilizing protein with a M, of 
about 40,000 and has been designated HD40. It binds strongly 
to single-stranded nucleic acids, forming complexes which are 
strikingly similar to the native “beads on a string” structures 
of hnRNPs (18,19). The biochemical characteristics of HD40: 
high glycine content, very little cysteine, presence of dimeth¬ 
ylarginine, and a blocked amino terminus suggest that this 
protein is a functional analogue of the hnRNP core proteins 
from higher eukaryotes (17,18). Immunoelectrophoresis with 
a polyclonal antibody raised in rabbits against HD40 reveals 
the presence of at least three different isoelectric forms of 
HD40 and three or four other antigenically related proteins 
(M r 30,000-40,000) in Artemia 30 S particles (17). 

As an initial approach for studying hnRNP proteins and 
their function we had undertaken the cloning of an A. salina 
hnRNP protein. We have previously reported the cloning of 
a partial cDNA for such a protein using an anti-HD40 anti¬ 
body (20). We determined the presence of sequences homol¬ 
ogous to the cloned cDNA across eukaryotes, from yeast to 
human by Southern blot analysis, suggesting the conservation 
of these proteins through evolution. 

We describe here the isolation of overlapping cDNAs cor¬ 
responding to the full-length transcript and the deduced com¬ 
plete amino acid sequence of the encoded protein. Analysis of 
this sequence and comparison with the only sequence of an 
hnRNP core protein published so far (21-23) provides some 
insight into the conserved structural features of these RNA 
binding proteins. 

EXPERIMENTAL PROCEDURES 2 
RESULTS 

In order to obtain a full-length cDNA, a new cDNA library 
was constructed in X gtll from Artemia total poly(A) + RNA 


2 Portions of this paper (including “Experimental Procedures” and 
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Chemistry, 9650 Rockville Pike, Bethesda, MD 20814. Request Doc¬ 
ument No. 87C-203, cite the authors, and include a check or money 
order for $2.00 per set of photocopies. Full size photocopies are also 
included in the microfilm edition of the Journal that is available from 
Waverly Press. 
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in addition to the one described previously (20). About 10 4 
recombinant phage were screened with a restriction fragment 
from the previously isolated clone 87HD (20) as a probe. One 
positive phage, termed X HD-1, was identified, and this was 
cloned by plaque purification and two cycles of rescreening. X 
HD-1 DNA contained an insert of approximately 1000 base 
pairs, which was found to overlap 87HD cDNA over 400 base 
pairs (Fig. L4). Since there were approximately 1000 base 
pairs that had only been cloned once either in 87HD or X HD- 
1, their authenticity was checked by the hybrid selection assay 
and by Northern blot hybridization in order to discard possi¬ 
ble cloning artifacts in these regions. All restriction fragments 
along the X HD-1 insert hybrid selected an RNA which was 
translated in vitro into a protein of approximately the same 
mobility as HD40 on SDS-PAGE. Moreover, the in vitro 
translated protein was immunoprecipitated with anti-HD40 
antibodies. In Northern blots the different labeled restriction 
fragments and clone 87HD detected an identical poly(A) + 
RNA species of 1300-1400 nucleotides (data not shown). 
These results indicate that the entire insert in X HD-1 rep¬ 
resents a cDNA derived from reverse transcription of a single 
mRNA. When restriction fragments along the cDNA insert 
in 87HD were analyzed in the same way, the 5'-most restric¬ 
tion fragment failed to hybridize to this RNA. Sequencing of 
this fragment revealed the presence of a poly(A) tail at its 5' 
end, suggesting that it is derived from the adventitious joining 
of another cDNA in opposite orientation. 

Another cDNA library was then constructed in order to 
obtain a cDNA containing the correct 5' end. Oligonucleotides 
1 and 2 (Fig. L4) were used as probes for the screening of the 
library. It should be pointed out that the oligonucleotides were 
contained in a restriction fragment known to hybrid select 
the correct mRNA and that both of them were shown to be 
complementary to the same mRNA by Northern blot hybrid¬ 
ization (not shown). About 10 4 recombinant phage plaques 
were screened, and 13 positive phage were identified with both 
oligonucleotides. Six of these, X 1HD-2 through X 1HD-7, were 
cloned by plaque purification and rescreening. 

The cDNA inserts in X1HD2 to X1HD7 were then subcloned 
into Ml3mp8 and/or M13mp9 and sequenced by the dideox - 
ynucleotide chain terminator method (31). Five of these 
cDNAs, those in X 1HD2-1HD6 were found to be identical, 
while the cDNA in X 1HD7 was a few bases shorter. The 
overlap of one of the five identical cDNAs with the insert in 
clone 87HD is shown in Fig. L4. Restriction fragments from 
the inserts in 87HD and X HD-1 were also subcloned into 
M13mp8 and/or M13mp9 and sequenced. The sequencing 
strategy is shown in Fig. IB. 

Primer extension analysis of total Artemia poly(A) + RNA 
(Fig. 2) shows a single band in a polyacrylamide-urea gel when 
oligonucleotide 1 was elongated with reverse transcriptase; 
this band corresponds to the addition of about 132 bases to 
the primer, the same length of extension seen on the five 
identical cDNAs isolated (X 1HD2-1HD6). This result indi¬ 
cates that these cDNAs contain the cap site of the RNA. 

The entire nucleotide sequence derived from the three 
overlapping cDNAs is shown in Fig. 3. The presence of a 60- 
base-long poly(A) stretch and a polyadenylation signal 14 
bases upstream indicates that this fragment corresponds to 
the 3' end of the mRNA. The sequence in Fig. 3 represents, 
therefore, that of a full-length cDNA. The length of the 
mRNA is 1208 bases without the poly(A) tail, in reasonable 
agreement with the size that had been previously estimated 
from its mobility in denaturing agarose gels and methyl mer¬ 
cury hydroxide-sucrose gradients (20). The possible open 
reading frames were analyzed; the longest one contains 939 
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Fig. 3. Full-length cDNA sequence and amino acid se¬ 
quence deduced from it. The whole cDNA sequence is shown 
starting at the cap site and including 10 A residues of the poly(A) 
tail. The polyadenylation signal is underlined. The sequence corre¬ 
sponding to the 5'- and 3' untranslated regions of the RNA is 
indicated by lower case letters. Glycine-arginine clusters are shown in 
boxes. Aromatic amino acids situated no more than 6 residues from 
the glycine-arginine groups are indicated by arrowheads. 

bases, from base 28 to base 966 as numbered in Fig. 3. The 
two other reading frames contain a large number of termina¬ 
tion codons. 

The deduced amino acid sequence starting at the first AUG 
codon from the 5' end of the mRNA is also shown in Fig. 3. 
This first AUG is found in the consensus context (AXXATGG 
(32)) and has been designated as the initiation site for trans¬ 
lation. Two in-frame termination codons (positions 22-24 and 
25-27) are found upstream of this initiation codon. There are 
924 bases to the first in-phase termination codon correspond¬ 
ing to an open reading frame coding for 308 amino acids. 
Based on the deduced amino acid composition, the molecular 
weight of the protein would be 32,992. In view of its high 
glycine content (16.6%), it has been termed GRP33 (glycine- 
rich protein, M r 33,000). 

As shown in Fig. 4, the predicted secondary structure (33) 
for GRP33 reveals two distinct domains. The N-terminal 
domain contains several possible regions of a-helix and fi- 
sheet; the smaller C-terminal domain would be expected to 
have an extended conformation. The hydropathy index has 
been determined along the GRP33 sequence (34) and as shown 
in Fig. 4, the C-terminal region is essentially hydrophilic. 

DISCUSSION 

We have previously used a polyclonal antibody raised in 
rabbits against a major protein component of A. salina hn- 
RNPs to screen a cDNA library from the same species. A 
partial cDNA clone had been identified containing sequences 
coding for a protein which appeared to be identical to HD40 
according to several criteria: the same electrophoretic mobility 
on SDS-PAGE, common antigenic determinant(s), and vir¬ 
tually identical products of partial proteolysis (20). We have 
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now obtained overlapping cDNAs containing all the coding 
sequences. However, the deduced molecular weight of the 
protein is 33,000 rather than about 40,000. This discrepancy 
can be explained by the fact that the in vitro translated protein 
shows an anomalously slow mobility on SDS-PAGE and is 
only distinguishable from the slower migrating HD40 upon 
electrophoresis on long gels. GRP33 and HD40 are in any 
case different proteins since several tryptic peptides of HD40 
have been recently sequenced 3 and are not present in GRP33. 
There are several antigenically related proteins in 30 S 
hnRNP particles of Artemia since antibodies raised against 
purified HD40 recognize on immunoelectrophoresis three or 
four other proteins of slightly smaller molecular weights than 
HD40 (17). Western blot analysis of a whole cell extract 
prepared from Arfemta-developed cysts also shows the exist¬ 
ence of several proteins cross-reacting with the anti-HD40 
antibody used for the screening of the cDNA library (data not 
shown). The relationship between these proteins is not yet 
known. A similar situation prevails with respect to antigeni¬ 
cally and biochemically related groups of hnRNP core pro¬ 
teins present in other species (3, 4, 6, 16). They may represent 
post-translational modifications of a single gene, alternative 
splicing products, or products of related genes. GRP33 may 
then be the precursor of one or more of the Artemia proteins 
which cross-react with anti-HD40 antibodies. 

GRP33 shares with a number of hnRNP core proteins what 
seems to be one of their typical conserved features, a high 
glycine content, 16.6 mol %, whereas the average frequency 
of glycine in eukaryotic proteins is 7.6 mol % (PIR Protein 
Sequence Database). Furthermore, 76.5% of the glycine resi¬ 
dues are within the 123-amino acid C-terminal region of the 
protein. The only complete sequence of a core hnRNP protein 
published so far, that of rat Al, shows the same unequal 
distribution of glycine, with 76.9% of the total glycine residues 
in the 124-amino acid C-terminal region of the molecule (21). 

The C-terminal region of GRP33 also shows an unusual 
content of arginines (10.6%), which are clustered with glycine 
residues (Fig. 3) and might be methylated in vivo. Two other 
nuclear proteins (rat Al and nucleolin) show similar Gly-Arg 
clusters in the C-terminal regions of the molecules (21, 35). 
Nucleolin, a 110,000 M r protein which resembles the hnRNP 
core proteins with respect to the presence of dimethylargi- 
nines and a high glycine content, seems to be associated with 
preribosomal RNA in the nucleus (35). In nucleolin, as in 
other proteins containing dimethylarginines, e.g. a 34,000- 
dalton nuclear scleroderma antigen from hepatoma cells (36) 
and human myelin basic protein (37), most of the methylated 
arginines are surrounded by glycines, suggesting that an ad¬ 
jacent glycine might be required for the methylation of an 
arginine. 

Relatively close to the Gly-Arg groups in this C-terminal 
domain are several aromatic amino acids: phenylalanine, tryp¬ 
tophan, and tyrosine (Fig. 3). Such aromatic amino acids have 
been shown to be involved in the binding of some proteins to 
single-stranded nucleic acids through intercalation of the 
aromatic residues with the nucleotide bases (38). 

The structural characteristics of the C-terminal domain of 
GRP33, a mostly hydrophilic region with a predicted extended 
conformation, are shared by two other RNA binding proteins 
(rat Al (21) and nucleolin (35)) and are consistent with this 
region being on the exterior of the molecule and perhaps 
capable of interacting with nucleic acid. 

Recent studies demonstrate that there is a close relation¬ 
ship between eukaryotic single-stranded DNA binding pro- 
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teins and hnRNP proteins (22, 39). It has been shown that 
the sequence of the calf thymus single-stranded DNA binding 
protein, UP1, is identical to that of the 195-amino-acid-long 
N-terminal domain of hnRNP protein Al (21-23). Despite 
the absence of the C-terminal glycine-rich domain, UP1 re¬ 
tains the ability of binding single-stranded nucleic acids, 
particularly DNA, suggesting that the C-terminal domain may 
modulate the specificity of the protein to bind RNA over 
single-stranded DNA. 

A similarity search (40) with the National Biochemical 
Research Foundation Protein Data Base showed the highest 
homology with an Epstein-Barr virus nuclear antigen (41): 
33.6% identity with GRP33 in a 119-amino-acid overlap. Both 
the N- and C-terminal regions of the Epstein-Barr protein 
contain repeating Gly-Arg units, and the protein has a high 
affinity for single-stranded DNA. 

In view of the scarcity of sequence information, the com¬ 
plete sequence of this glycine-rich protein which shares struc¬ 
tural features with some nuclear RNA binding polypeptides 
should further the understanding of protein-nucleic acid and 
protein-protein interactions in hnRNPs. 
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Note Added in Proof— After the submission of our paper, the 
complete sequence and primary structure of a human nuclear ribo- 
nucleoprotein particle C protein were published by Swanson et al. 
(Swanson, M. S., Nakagawa, T. Y., LeVan, K., and Dreyfuss, G. 
(1987) Mol. Cell. Biol. 7, 1731-1739). 
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